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Abstract 


In many multi agent learning problems, it is difficult to 
determine , a priori, the agent reward structure that will 
lead to good performance. This problem is particularly pro- 
nounced in continuous, noisy domains ill-suited to simple 
table backup schemes commonly used in TD( X)/Q-leaming. 
In this paper, we present a new reward evaluation method 
that allows the tradeoff between coordination among the 
agents and the difficulty of the learning problem each agent 
faces to be visualized. This method is independent of the 
learning algorithm and is only a function of the problem do- 
main and the agents’ reward structure. We then use this re- 
ward efficiency visualization method to determine an effec- 
tive reward without performing extensive simulations. We 
test this method in both a static and a dynamic multi-rover 
learning domain where the the agents have continuous state 
spaces and where their actions are noisy (e.g., the agents' 
movement decisions are not always carried out properly ). 
Our results show that in the more difficult dynamic domain, 
the reward efficiency visualization method provides a two 
order of magnitude speedup in selecting a good reward. 
Most importantly it allows one to quickly create and ver- 
ify rewards tailored to the observational limitations of the 
domain. 


1. Introduction 

Recent advances in distributed learning methods have 
addressed how to best create rewards that promote coor- 
dination in a multi agent system [?, 8, 9, 11, 13]. This is 
a fundamental challenge that applies to most multi agent 
learning problems, but particularly to learning in dynamic 
environments. Indeed, most coordination methods that per- 
form well in static environments often perform poorly in dy- 
namic environments ??. In this paper, we present a reward 
evaluation method that directly addresses this issue by ex- 
plicitely visualizing the coordination properties of a reward 


in both static and dynamic environments. This reward eval- 
uation method is based on two important properties in multi 
agent coordination: 

1. how well the reward promotes coordination among 
agents in different parts of a domain’s state-space; and 

2. how easy it is for an agent to learn to maximize that 
reward. 

These reward visualization method provide the ability to 
predict the reward performance in a given domain with- 
out the need for lengthy learning trials. Furthermore, this 
method can be used to create either new sets of coordina- 
tion mechanisms or new reward structures based on the spe- 
cific needs of the domain. 

We explore the agent reward design and evaluation in a 
continuous rover problem where a set of rovers learn to nav- 
igate and collect information in an unknown environment 
based on their noisy sensor inputs [2], Reinforcement learn- 
ing and credit assignment is particulary challenging in this 
case because traditional table-based reinforcement learn- 
ing methods such as Q-learing, TD(A) and Sarsa learners 
are ill-suited to this domain [10], Instead, we select a di- 
rect policy search method where the full control policy is 
evaluated after each learning episode. Note that this domain 
is not only more realistic but also significantly more diffi- 
cult than previous multi-rover coordination problems where 
agents learned to take discrete actions in a static grid-world 
setting [11], Therefore, having well tailored and computa- 
tionally tractable agent rewards is particularly important in 
this domain. 

In this paper we provide evaluation and visualisation 
methods for multi-agent coordination problems in noisy 
domains with continuous state spaces. We discuss three 
types of agent rewards that vary in how well they pro- 
mote coordination and how easy it is for the agents to learn 
them. A new visualization method is then used to determine 
which reward is best suited in the Continuous Rover Prob- 
lem. In addition, the visualization is used to provide new 
agent rewards that take the rovers partial observation limi- 


tations into account while retaining much of the salient fea- 
tures (e.g., coordination) of the full reward. Section 2 de- 
scribes the key reward properties required for evaluating 
agent rewards and discusses three types of rewards. Sec- 
tion 3 presents the Continuous Rover Problem, and provides 
the simulation details. Section 4 presents the visualisation 
results that allow the evaluation of the rewards, and Sec- 
tion 5 presents the simulation results. 

2. Rewards for Agent Coordination 

In this work, we focus on cooperative multi-agent sys- 
tems where each agent i is taking actions to maximize its 
own agent reward g x , and where the performance of the 
fuil -system-is- measured -by- the global-reward G - -The- sy-s- 
tem state z is decomposed into a component that depends 
on the state of agent i, denoted by Z{, and a component that 
does not depend on the state of agent i, denoted by z_,. 
(We will use the notation z = Zi + z_* to concatenate the 
state vectors.) Note that though agent i may or may not in- 
fluence the full state z, both G and g x are functions of z, the 
full state of the system. 

2.1. Factoredness and Learnability 

There are two properties that are crucial to producing co- 
operative multi agent systems in which agents acting to op- 
timize their own agent rewards will also optimize the pro- 
vided global reward. The first, called factoredness concerns 
“aligning” the agent rewards of the agents with the global 
reward. For an agent i, let us define the degree of factored- 
ness between the rewards g x and G at point z as: 

-r- 9*(z'))(G(z)-G{z'))} 

E,i 

where the states z and z' only differ in the states of agent i, 
and u[x] is the unit step function, equal to 1 if x > 0. In- 
tuitively, the degree of factoredness gives the percentage of 
states in which a change in the action of agent i has the same 
impact on gi and G. A high degree of factoredness means 
that the agent reward g x is aligned with the global reward G. 
As a trivial example, any system in which all the agent re- 
wards equal G has a degree of factoredness of 1 . 

The second property, called learnability, measures the 
dependence of a reward on the actions of a particular agent 
as opposed to all the other agents. Let us first define the 
point learnability of reward g x , between state z and z 1 as 
the ratio of the change in g z due to a change in the states of 
agent i over the change in g x due to a change in the states of 
other agents: 


where z’ is an alternate to state z (e.g., in the numerator of 
Eq 2, agent i s state is changed from z to z', whereas in the 
denominator, the state of all other agents is changed from z 
to z')- The learnability of a reward g t is then given by: 


L(gi, z) 


T,z' L (9iG, z') 
1 


(3) 


Intuitively, the higher the learnability, the more g x depends 
on the move of agent i, i.e., the better the associated signal- 
to-noise ratio for i. Therefore, higher learnability means it 
is easier for i to receive large values of its reward. Note 
that both learnability and factoredness are computed local 
to a particular state. Later we analyze how these properties 
change through the state space. 


2.2. Multi-Agent Rewards 


The selection of a reward that provides the best per- 
formance hinges on balancing the degree factoredness and 
learnability for each agent. In general, a highly factored re- 
ward will have low learnability and a highly learnable re- 
ward will have low factoredness [13]. In this work, we ana- 
lyze three different rewards that provide different trade-offs 
between learnability and factoredness: X). the team game 
reward (Eq. 4), P,, the perfectly learnable reward (Eq. 5) 
and D t , the difference reward (Eq. 6) given by: 


G{z) 

(4) 

G(zi) 

(5) 

G{z) - G(z_ t ). 

(6) 


Ti provides the full global reward to each agent. It is fully 
factored by definition, but because each agent’s reward de- 
pends on the states of all the other agents, it generally has 
poor learnability, a problem that get progressively worse as 
the size of the system grows. P x provides the component of 
the global reward that depends on the states of agent i. Be- 
cause it does not depend on the states of other agents, P, 
is “perfectly learnable” having infinite learnability. How- 
ever, depending on the domain, it may have low degree of 
factoredness. D x provides rewards that have high factored- 
ness, because the second term of Eq. 6 does not depend on 
i’s states [13]. Furthermore, Di usually has better learnabil- 
ity than does P,, because the second term of D x removes 
some of the effects of other agents (i.e., noise) from i’s re- 
ward. While having good properties, this reward is often im- 
practical to compute because it requires a lot of knowledge 
about z to compute G’(z_i). In practice either of the three 
rewards may be the best choice depending on their proper- 
ties in a particular domain. 

3. Continuous Rover Problem 


L{gi,z,z’) 
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( 2 ) 


In this section, we define the “Continuous Rover Prob- 
lem,” that will be used illustrate the importance of visualiza- 


don and proper reward selection in a difficult noisy, contin- 
uous, multi-agent domains. In this problem, multiple rovers 
try to observer points of interest (POIs) on a two dimen- 
sional plane. A POI has a fixed position on the plane and 
has a value associated with it. The value of the information 
from observing a POI is inversely related to the distance the 
rover is from the POI. In this paper the distance metric will 
be the squared Euclidean norm, bounded by a minimum ob- 
servation distance, d: 1 


6(x,y) = min{\\x - y\\ 2 ,d 2 } . (7) 


While any rover can observe any POI, as far as the global 
reward is concerned, only the closest observation counts 2 . 
The full system, or global reward for an episode is given by: 




V, 

min., 8{Lj,Li) 


( 8 ) 


where V, is the value of POI j, Lj is the location of POI j 
and £j is the location of rover i. 

At every time step, the rovers sense the world through 
eight continuous sensors. From a rover’s point of view, the 
world is divided up into four quadrants relative to the rover’s 
orientation, with two sensors per quadrant (see Figure 1). 
For each quadrant, the first sensor returns a function of the 
POIs in the quadrant. Specifically the first sensor for quad- 
rant q returns the sum of the values of the POIs divided by 
their squared distance to the rover: 
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where I q is the set of observable POIs in quadrant q. The 
second sensor returns the sum of square distances from a 
rover to all the other rovers in the quadrant: 


^2 ,q,i 
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( 10 ) 


where N q is the set of rovers in quadrant q. 


3.1. Simulation Set-up 

With four quadrants and two sensors per quadrant, there 
are a total of eight continuous inputs. This eight dimen- 
sional sensor vector constitutes the state space for a rover. 


1 The square Euclidean norm is appropriate for many natural phe- 
nomenon, such as light and signal attenuation. However any other type 
of distance metric could also be used as required by the problem do- 
main. The minimum distance is included to prevent singularities when 
a rover is very close to a POI 

2 Similar rewards could also be made where there are many different 
levels of information gain depending on the position of the rover. For 
example 3-D imaging may utilize different images of the same ob- 
ject, taken by two different rovers. 



Figure 1. Diagram of a Rover’s Sensor Inputs. 
The world is broken up into four quadrants 
relative to rover’s position. In each quadrant 
one sensor senses points of interests, while 
the other sensor senses other rovers. 


At each time step the rover uses its state to compute a 
two dimensional action. The action represents an x,y move- 
ment relative to the rover’s location and orientation. The 
mapping from state to action is done with a multi- lay er- 
perceptron (MLP), with 8 input units, 10 hidden units and 2 
output units [6]. The MLP uses a sigmoid activation func- 
tion, therefore the outputs are limited to the range (0, 1). 
The actions, dx and dy, are determined from substracing 0.5 
from the output and multiplying by the maximum distance 
the rover can move in one time step: dx = d(o\ — 0.5) and 
dy — d(o 2 — 0.5) where d is the maximum distance the rover 
can move in one time step, cq is the value of the first output 
unit, and ch is the value of the second output unit. To better 
simulate the inaccuracies and imperfections of a rover op- 
erating in the real world, ten percent noise is added to each 
action. The MLP for a rover is chosen through local search 
[5], where the weights of the MLP is modified and selected 
with preset probabilities. Note, this is a form of direct pol- 
icy search, where the MLPs are the policy [3], 

In these simulations, there are thirty rovers, and each 
episode consists of 15 time steps. The world is 100 units 
long and 1 15 units wide. All of the rovers start the episode 
near the center (60 units from the left boundary and 50 units 
from the top boundary). The maximum distance the rovers 
can move in one direction during a time step, d, is set to 10. 
The minimum distance, d, used to compute 8 is equal to 5. 
System performance is measured by how well the rovers are 
able to maximize the sum of global rewards for an episode, 
though each rover is trying to maximize its own agent re- 
ward, discussed below. 


3.2. Rover Rewards 


In this paper three different types of agent rewards are 
tested in the Rover Problem. The first reward is the team 
game reward (Ti) where the agent reward is set to the global 
reward given in equation 8. The second reward is the “per- 
fectly learnable” reward (Pi): 
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Note that P, is equivalent to T,. when there is only one rover. 
It also has infinite learnability as defined in Section 2 (de- 
nominator is equal to zero since for P, g^z' — z[ + z : ) = 
off a)). However. P is not factored. Intuitively P, andT, of - 
fer opposite benefits, since T x is by definition factored, but 
has poor learnability. The third reward is the difference re- 
ward. It does not have as high learnability as P t , but is still 
factored like T; . For the rover problem, the difference re- 
ward. D,, is defined as: 
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where Ij, x (z ) is an indicator function, returning one if and 
only if POI j is the closest rover to L y The second term of 
the D x is equal to the value of all the information collected 
if rover i is not in the system. Note that in practice it may 
be difficult to compute this reward since each rover needs 
to know the locations of all of the other rovers. It may even 
be more difficult to compute than the team game reward, T ; , 
since T x is the same for all the rovers. In many cases T x can 
be computed once and then broadcast to all the agents. 


3.3. Static and Dynamic Environments 

In the static environment the set of POIs remained fixed 
for all learning episodes. The POI distributions ranged from 
randomly distributed across the state to checkerboard pat- 
terns of uniform POIs. The results and insights gained from 
visualization were qualitatively similar in all cases. To illus- 
trate the impact of visualization, we selected the POI dis- 
tribution depicted in Figure 2, which required a moderate 
amount of coordination. The 15 POIs to the left have value 
3.0, and the lone POI to the right has of 10.0. 

For the dynamic environment, the POI distribution 
changed every 15 time steps, and the rovers faced a differ- 
ent configuration at each episode. In each episode, there 
were one hundred POIs of equal value, distributed ran- 
domly within a 70 by 70 unit squared centered on the 
rovers’ starting location. In the static environment, the 



Figure 2. Diagram of Static Environment. 
Points of i nterests are at fixed loca tions for 
every episode. 


rovers learned specific control policies for a given con- 
figuration of POIs. This type of learning is most use- 
ful when the rovers learn on a simulated environment that 
closely matches the environment in which they will be de- 
ployed. However, in general it is more desirable for the 
rovers to directly learn the sensor/action mapping inde- 
pendently from the specific POI configuration, so that they 
can generalize to POI configurations that may be signifi- 
cantly different than the ones in which they were trained. 
The dynamic environment experiment tests the rovers’ abil- 
ity to generalize in constantly changing environmental 
conditions. 

This type of problem is common in real world domains, 
where the rovers typically learn in a simulator and later 
have to apply their learning to the environment in which 
they are deployed. Note that this is a fundamentally diffi- 
cult learning problem because: 1) the environment changes 
every episode, 2) noise is added to the actions of the rovers, 
3) the state space is continuous, and 4) thirty rovers must co- 
ordinate. Therefore, the selection of the agent reward is crit- 
ical to success and many rewards that can be used in more 
benign domains (e.g., grid world rovers) are unlikely to pro- 
vide satisfactory results. 

4. Reward Visualization 

Visualization is an important part of understanding the 
inner workings of many systems, but particularly those of 
learning systems [7, 4, 12, 1], This paper focuses on visu- 
alizing reward properties to aid in both agent reward eval- 
uation and design. To analyze the rewards in a specific do- 
main, we plot the learnability and factoredness of a reward 
measured at a set of states in the domain. This visualiza- 
tion helps determine which of the many possible rewards 
one expects to perform well in a particular domains. 
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Figure 3. Factoredness and Learnability Visualization in Static Environment. First row shows fac- 
toredness of four rewards and second row shows their learnability. The visualization is a projection 
of an agent’s state space, with increasing x values corresponding to states closer to POIs and in- 
creasing y values corresponding to states where the agent is closer to other agents. P, has low fac- 
toredness and is anti-factored for much of region 1. D t under partial observability ( Di(PO )) is much 
more factored. D t (PG) has higher learnability than D { , especially in region 2. X) generally has low 
learnability, but is sufficient in region 3, corresponding to regions close to POIs. 


The analysis starts by recording the states observed by 
agents taking a random set of actions for a fixed number of 
episodes. Then for each reward, we compute the learnabil- 
ity and factoredness by sampling Equations 1-3. The learn- 
ability and factoredness values for each state are then pro- 
jected onto a two-dimensional plane, using a domain depen- 
dent projection. The projection is then broken up into fixed 
sized squares and all the values within a square are aver- 
aged. 

In a learnability visualization, points where an agent’s 
action influences its reward more than the actions of other 
agents are represented with a “+” sysmbol. The lighter the 
“+” symbol, the more an agent influences its own reward. 
Points where an agent’s actions influence its reward less 
than the actions of other agents are represented with a “- 
symbol. The lighter the symbol, the less an agent 
influences its reward. In factoredness visualization, points 
where an agent’s reward is aligned with the global reward 
more often than random are represented with a “+” sym- 
bol. The lighter the “+” symbol the more factored the re- 
ward is. Points where an agent’s reward is aligned with the 
global reward less often than random (anti-aligned) are rep- 


resented with a symbol. The lighter the symbol the 
more anti-factored the reward is. 

In this domain, the projection axis are formed using the 
eight sensor values used by the rovers. The x axis of the 
projection corresponds to the sum of the four sensor values 
corresponding to POI distance, and the y axis corresponds 
to the sum of the four sensor values corresponding to other 
rover distance. Therefore values at the left side of the vi- 
sualization correspond to states where a rover is far away 
from the POIs, and values at the right side of the visualiza- 
tion correspond to states where the rover is close to POIs. 
Similarly values at the bottom of a visualization correspond 
to states where a rover is not close to any other rover, and ar- 
eas towards the top of the visualization correspond to states 
where the rover is close to other rovers. 

4.1. Visualization in Static Environments 

Figure 3 shows the learnability' and factoredness visual- 
izations for the the static environment. P l is highly factored 
in some pans of the state space, particularly the lower right 
corner. That space corresponds to conditions where there 



are many POIs but few other rovers in the rover’s vicinity. 
It is not surprising that in such conditions where coordina- 
tion is not relevant this reward provides the right incentives. 
It is important to note that P t has high learnability across the 
board, a result that is expected from how the reward is con- 
structed. While Pi has high learnability across the board, 
and is therefore easy for the agents to learn, this visualiza- 
tion implies that in many states it results in the agent learn- 
ing to take the wrong actions. Because P, has better fac- 
toredness than random, for most states, we expect agents 
using Pi in this environment to reach a reasonable level of 
proficiency. 

The situation is almost entirely reversed for Ti in this 
environement. It is by definition fully factored (except for 
states-that-have— ao^beensampledi-which-sh&vwupas black 
in Figure 3), but has low learnability almost across the 
board. Ti has good learnability only on the right side of 
the visualization, corresponding to states where the rover 
is close to the POIs. This is an important part of the state 
space so we expect that agents using T x to learn in this do- 
main, though learning will be slow since the agents receive 
proper reinforcement signals only after they stumble upon 
regions with POIs. 

Di on the other hand is both fully factored and highly 
learnable. However, to compute D x , a rover needs to be able 
to observe all of the other rovers which may be impractical 
in many domains (note, T x also requires this). Instead we 
compute the partially observable Di, where only the rovers 
within a radius equal to the maximum distance a rover can 
move in one time step are observed. This is a severe restric- 
tion that forces the agents to focus on less than 3% of the 
state space at any time in search of other rovers. While this 
reward is no longer fully factored, the factoredness visual- 
ization (labeled Di(PO)) shows that the reward is still rea- 
sonably factored. In addition if we look at the right side 
of the learnability visualization for Di and Di(PO) (verti- 
cal rectangle marked 2 in Figure 3), we see that Di(PO) is 
more learnable in this part of the state space. Considering 
this part of the state corresponds the important area where a 
rover is close to a POI, we expect agents using D t (PO ) to 
perform even better than agents using Di in this static envi- 
ronment domain. 

For the static domain, we can gain additional insight into 
the differences between the rewards by displaying the fac- 
toredness visualization projected directly on the x, y do- 
main in which the rovers move. This shows how the re- 
wards map to actions directly taken by the rovers. Figure 
4 shows the factoredness for D x {PO ) and Pi (on this pro- 
jection, T x and D, are fully factored, meaning each square 
is a light Note that around the POIs, both rewards are 
factored. However, there is an anti-factored boundary for 
P t between the two regions. That means that agents are re- 
stricted to the right or left hand side of the x, y grid, and will 



Figure 4. Factoredness Projected onto Do- 
main Coordinates. Factoredness of P t and 
Di(PO) is projected onto the x,y coordinates 
ofthe^domainenvironmentinsteadofonto 
the feature space used by rovers. "+” repre- 
sents factoredness and represents anti- 
factoredness. P t has an anti-factored bound- 
ary preventing agents from moving from one 
region to the other. 


not cross that boundary if doing so would benefit the global 
reward. This means the performance of P t will be partic- 
ularly sensitive to the initial random actions taken by the 
rovers. Notice that though not highly factored in that region, 
D z (PO ) has two “bridges” to cross this region and further- 
more is lightly factored rather than anti-factored in the rest 
of that region. This implies that P,(PO) will not have fac- 
toredness problems in this domain. 

4.2. Visualization in Dynamic Environments 

Figure 5 shows the factoredness and learnability visual- 
izations for dynamic environments. They show that in this 
more difficult environment, neither Pi nor T x are acceptable. 
The factoredness deficiencies of P t are amplified in this en- 
vironment as are the learnability defficiencies of T x . In fact 
the learnability is so low that there is reason to expect Tj to 
perform marginally betten than a random algorithm. P x is 
only consistently factored in the bottom left part of the vi- 
sualizations, corresponding to unimportant locations where 
the rover is not close to any POIs or close to any other rover. 
In fact, in more important areas of the state space, P, is of- 
ten anti-factored, leading one to expect agents using P, to 
perform very poorly in this environment. 

In contract, Di is both highly learnable and highly fac- 
tored in this domain. In fact, there is little difference be- 
tween the leamability/factoredness charts of D x in this dif- 
ficult domain and in the static domain. Given that Di is fully 
factored we would expect rovers using D x to perform very 
well. However, again Di is difficult to compute in practice, 
as it requires a rover to know the locations of all of the other 



Figure 5. Factoredness and Learnability Visualization in Dynamic Environments. First row shows fac- 
toredness the four rewards and second row shows their learnability. The visualization is a projec- 
tion of an agent’s state space. The visualizations show that P l has very low factoredness and Ti has 
very low learnability. Di(PO) (computed with partial observability) still has high factoredness. 


rovers. As in the static domain we can compute Di(PO) 
where the rover can only observe other rovers within a ra- 
dius equal to the maximum distance it can move at one 
time step. Though not as high as that of Di, the factored- 
ness of Di(PO) is still consistently high. Therefore we ex- 
pect rovers using D x {PO) to significantly outperform both 
Ti and P t . 

5. Reward Performance 

In this section we show the results from a set of experi- 
ments in both the static environment and dynamic environ- 
ments to evaluate the effectiveness of the rewards in these 
domains. The experiments confirm the expectation obtained 
from the factoredness and learnability visualizations. 

Figure 6 shows results from the static environment. The 
rovers using Pi learned quickly, but did not converge to 
good solutions. This is consistent with the high learnabil- 
ity/low factoredness properties of P t that were apparent in 
the visualizations. In contrast agents using Ti were able 
to keep improving their performance through learning, and 
were able to surpass the performance of P*. However as pre- 
dicted from the learnability visualization, these rovers learn 
slowly, so T'i may be a poor choice of reward in quick learn- 
ing is needed. As expected, rovers using Di with full ob- 


servability performed very well, since Di is both highly 
leamable and fully factored. More interestingly, the rovers 
using Di(PO) performed even better though Di(PO) is 
not fully factored. This confirms that the gains m learnabil- 
ity more than offset the slight loss in factoredness shown in 
the visualizations. Note this is remarkable, since Di(PO) 
is in fact significantly easier to compute than D x . 
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Figure 6. System performance in static envi- 
ronments. As predicted by the visualizations, 
Agents using P x have mediocre performance, 
agents T, learn slowly and, Di(PO) retains 
enough factoredness to perform well. 



Figure 7 shows that rovers using T % or P l per- 
form very poorly in the dynamic environments as pre- 
dicted from the learnability and factoredness visualiza- 
tions. The performance of rovers using P, actually de- 
clines with learning, highlighting the fact that P* leads 
the rovers to learn the wrong thing. This results con- 
firms the intuition that highly learnable but poorly factored 
rewards can in fact be worse than random actions in diffi- 
cult environments requiring coordination. Rovers using D x 
with full observability performed the best and rovers us- 
ing D l (PO) performed well. In this more difficult do- 
main, Di(PO) did not have significant learnability gains 
oyer D x , and therefore, did not overcome the drop in fac- 
toredness. The Di(PO) results are still impressive though 
-as-they-are-obtained-b-y-collecting-onIy-abouL-l^i-of-the-in=- 
formation about the location of others rovers compared to 

D r . 



Figure 7. Results in Dynamic Environments. 
As predicted from the visualization, agents 
using T t perform poorly, agents using P, per- 
form even worse as they learn the wrong ac- 
tions, agents using D; perform the best, and 
agents using D x {PO) perform quite well. 


6. Conclusion 

The effectiveness of agent rewards in promoting coordi- 
nation in a complex multi-agent system is heavily domain 
dependent. In many cases, rewards or coordination mecha- 
nisms that work well in static environments perform poorly 
in dynamic environments. This paper shows that the visu- 
alization of two critical reward properties can dramatically 
accelerate and reduce the difficulties associated with choos- 
ing choose good agent rewards and coordinatin mechanism 
in difficult multi-agent problems. In addition based the re- 
wards can be modified to meet the computational and infor- 
mational demands of a domain and then quickly validated. 
We demonstrate this capability by predicting the perfor- 
mance characteristics of a set of rewards in a noisy, contin- 
uous multi-rover domain, and show that some rewards that 


do work reasonably well in the static environment fall apart 
in the dynamic environment. This visualization method is 
one to orders of magnitude faster than running a full learn- 
ing simulation to validate the agent reward. We used this vi- 
sualization method to design and validate a reward based 
on a computationally expensive reward. This reward only 
needed less than 3% of the observational capability as the 
full reward, but as predicted by the visualization performed 
nearly as well as the full reward in the dynamic environ- 
ment. 

References 

[1] A. Agogino, C. Martin, and J. Ghosh. Visualization of ra- 
dial basis function networks. In Proceedings of Intema- 

tional Joint Conference on Neural Networks, Washington, 
DC, 1999. 

[2] A. Agogino and K. Turner. Efficient evaluation functions for 
multi-rover systems. In Proceedings of the Genetic and Evo- 
lutionary Computation Conference (GECCO-2004), pages 
1-12, Seattle, WA, 2004. 

[3] L. Baird and A. Moore. Gradient descent for general rein- 
forcement learning. In Advances in Neural Information Pro- 
cessing Systems (NIPS), pages 968-974, Cambridge, MA, 
1999. The MIT Press. 

[4] H. Bishof, A. Pinz, and W. G. Kropatsch. Visualization meth- 
ods for neural networks. In 1 1th International Conference 
on Pattern Recognition, pages 581-585, The Hague, Nether- 
lands, 1992. 

[5] C. M. Bishop. Neural Networks for Pattern Recognition. Ox- 
ford University Press, New York, 1995. 

[6] S. Haykin. Neural Networks A Comprehensive Foundation 
Macmillan College Publishing Company, New York, 1994. 

[7] G. Hinton. Connectionist learning procedures. Artificial In- 
telligence, 40:185-234, 1986. 

[8] J. Hu and M. P. Wellman. Multiagent reinforcement learn- 
ing: Theoretical framework and an algorithm. In Proceed- 
ings of the Fifteenth International Conference on Machine 
Learning, pages 242-250, June 1998. 

[9] M. J. Mataric. Coordination and learning in multi-robot sys- 
tems. In IEEE Intelligent Systems, pages 6-8, March 1998. 

[10] R. S. Sutton and A. G. Barto. Reinforcement Learning: An 
Introduction. MIT Press, Cambridge, MA, 1998. 

[11] K. Turner, A. Agogino, and D. Wolpert. Learning sequences 
of actions in collectives of autonomous agents. In Pro- 
ceedings of the First International Joint Conference on Au- 
tonomous Agents and Multi-Agent Systems, pages 378-385, 
Bologna, Italy, July 2002. 

[12] J. Wejchert and G. Tesauro. Visualizing processes in neu- 
ral networks. IBM Journal of Research and Development, 
35:244-253, 1991. 

[13] D. H. Wolpert and K. Turner. Optimal payoff functions 
for members of collectives. Advances in Complex Systems, 
4(2/3):265-279, 2001. 


